{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Copyright &copy; 2020 The Alibaba Authors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 载图\n",
    "\n",
    "GraphScope 以 [属性图](https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model) 建模图数据。 图上的点/边都带有标签（label），每个标签都可能带有许多属性（property)。\n",
    "\n",
    "在这个教程中，我们将展示 GraphScope 如何载入一张图，包括\n",
    "\n",
    "- 如何快速载入内置数据集\n",
    "- 如何配置图的数据模型（schema）\n",
    "- 从多种存储中载图\n",
    "- 从磁盘中序列化/反序列化图"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 前期准备\n",
    "首先，创建会话并导入相关的包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install graphscope package if you are NOT in the Playground\n",
    "!pip3 install graphscope"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import graphscope\n",
    "\n",
    "graphscope.set_option(show_log=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 载入内置数据集\n",
    "GraphScope 内置了一组流行的数据集，以及载入他们的工具函数，帮助用户更容易的上手。\n",
    "\n",
    "来看一个例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from graphscope.dataset import load_ldbc\n",
    "\n",
    "graph = load_ldbc()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在单机模式下，GraphScope 会将数据文件下载到 `${HOME}/.graphscope/dataset`，并且会保留以供将来使用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 载入自己的数据集\n",
    "\n",
    "然而，更常见的情况是用户需要使用自己的数据集，并做一些数据分析的工作。\n",
    "\n",
    "我们提供了一个函数用来定义一个属性图的模型（schema），并以将属性图载入 GraphScope：\n",
    "\n",
    "首先建立一个空图："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import graphscope\n",
    "from graphscope.framework.loader import Loader\n",
    "\n",
    "graph = graphscope.g()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``Graph`` 有几个方法来配置：\n",
    "\n",
    "```python\n",
    "\n",
    "    def add_vertices(self, vertices, label=\"_\", properties=None, vid_field=0):\n",
    "        pass\n",
    "\n",
    "    def add_edges(self, edges, label=\"_e\", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):\n",
    "        pass\n",
    "```\n",
    "这些方法可以增量的构建一个属性图。\n",
    "\n",
    "我们将使用 `ldbc_sample` 里的文件做完此篇教程的示例。你可以在 [这里](https://github.com/GraphScope/gstest/tree/master/ldbc_sample) 找到源数据。\n",
    "\n",
    "你可以随时使用 ``print(graph.schema)`` 来查看图的模型."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build Vertex\n",
    "\n",
    "我们可以向图内添加一个点标签。相关的参数含义如下：\n",
    "\n",
    "#### vertices\n",
    "\n",
    "`Loader Object`，代表数据源，指示 ``graphscope`` 可以在哪里找到源数据，可以为文件路径，或者 numpy 数组等；\n",
    "\n",
    "一个简单的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "这将会从文件 `${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv` 载入数据，并且创建一个名为 ``_`` 的边，但是有不同的起始点标签和终点标签。\n",
    "\n",
    "#### label\n",
    "\n",
    "点标签的名字，默认为 ``_``.\n",
    "\n",
    "一张图中不能含有同名的标签，所以若有两个或以上的标签，用户必须指定标签名字。另外，总是给标签一个有意义的名字也有好处。\n",
    "\n",
    "可以为任何标识符 (identifier)。\n",
    "\n",
    "举个例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果与上一步结果除了标签名完全一致。\n",
    "\n",
    "#### properties\n",
    "\n",
    "一组属性名字。可选项，默认为 ``None``。\n",
    "\n",
    "属性名应当与数据中的首行表头中的名字相一致。\n",
    "\n",
    "如果省略或为 ``None``，除ID列之外的所有列都将会作为属性载入；如果为空列表 ``[]``，那么将不会载入任何属性；其他情况下，只会载入指定了的列作为属性。\n",
    "\n",
    "比如说："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# properties will be firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed\n",
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    "    properties=None,\n",
    ")\n",
    "\n",
    "# properties will be firstName, lastName\n",
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    "    properties=[\"firstName\", \"lastName\"],\n",
    ")\n",
    "\n",
    "# no properties\n",
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    "    properties=[],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### vid_field\n",
    "\n",
    "作为 ID 列的列名，默认为 0。此列将在载入边时被用做起始点 ID 或目标点 ID。\n",
    "\n",
    "其值可以是一个字符串，此时指代列名；\n",
    "\n",
    "或者可以是一个正整数，代表第几列 （从0开始）。\n",
    "\n",
    "默认为第0列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    vid_field=\"id\",\n",
    ")\n",
    "\n",
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    vid_field=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build Edge\n",
    "\n",
    "现在我们可以向图中添加一个边标签。\n",
    "\n",
    "#### edges\n",
    "\n",
    "与构建点标签一节中的 ``vertices`` 类似，为指示去哪里读数据的路径。\n",
    "\n",
    "让我们来看一个例子：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "# Note we already added a vertex label named 'person'.\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"person\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这将会载入一个标签名为 ``_e`` 的边，源节点标签和终点节点标签都为 ``person``，第一列作为起点的点ID，第二列作为终点的点ID。其他列都作为属性。\n",
    "\n",
    "#### label\n",
    "\n",
    "边的标签名，默认为 ``_e``。推荐总是使用一个有意义的标签名。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"knows\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"person\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### properties\n",
    "\n",
    "一列属性，默认为 ``None``。 意义与行为都和点中的一致。\n",
    "\n",
    "#### src_label and dst_label\n",
    "\n",
    "起点的标签名与终点的标签名。我们在上面的例子中已经看到过了，在那里将其赋值为 ``person``。这两者可以取不同的值。举例来说："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"comment\",\n",
    ")\n",
    "# Note we already added a vertex label named 'person'.\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"comment\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### src_field and dst_field\n",
    "\n",
    "起点的 ID 列名与终点的 ID 列名。 默认分别为 0 和 1。\n",
    "\n",
    "意义和表现与点中的 ``vid_field`` 类似，不同的是需要两列，一列为起点 ID， 一列为终点 ID。 以下是个例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"comment\",\n",
    ")\n",
    "\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"comment\",\n",
    "    src_field=\"Person.id\",\n",
    "    dst_field=\"Comment.id\",\n",
    ")\n",
    "# Or use the index.\n",
    "# graph = graph.add_edges(Loader('${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 高级用法\n",
    "\n",
    "这是一些用来处理特别简单或特别复杂的高级一些的用法。\n",
    "\n",
    "### 没有歧义时，自动推断点标签\n",
    "\n",
    "如果图中只存在一个点标签，那么可以省略指定点标签。\n",
    "GraphScope 将会推断起始点标签和终点标签为这一个点标签。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 从边中推断点\n",
    "\n",
    "如果用户的 ``add_edges`` 中 ``src_label`` 或者 ``dst_label`` 取值为图中不存在的点标签，``graphscope`` 会从边的端点中聚合出点表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "# Deduce vertex label `person` from the source and destination endpoints of edges.\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"person\",\n",
    ")\n",
    "\n",
    "graph = graphscope.g()\n",
    "# Deduce the vertex label `person` from the source endpoint,\n",
    "# and vertex label `comment` from the destination endpoint of edges.\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"comment\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 多种边关系\n",
    "\n",
    "在一些情况下，一种边的标签可能连接了两种及以上的点。例如，在下面的属性图中，有一个名为 ``likes`` 的边标签，\n",
    "连接了两种点标签，i.e., ``person`` -> ``likes`` <- ``comment`` and ``person`` -> ``likes`` <- ``post``。\n",
    "在这种情况下，可以添加两次名为 ``likes`` 的边，但是有不同的起始点标签和终点标签。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = graphscope.session(cluster_type=\"hosts\", num_workers=1, mode=\"lazy\")\n",
    "graph = sess.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"comment\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"post\",\n",
    ")\n",
    "\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"comment\",\n",
    ")\n",
    "\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"post\",\n",
    ")\n",
    "graph = sess.run(graph)\n",
    "print(graph.schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意:\n",
    "\n",
    "   1. 这个功能目前只在 `lazy` 会话中支持。\n",
    "   2. 对于同一个标签的多个定义，其属性列表的数量和类型应该一致，最好名字也一致，\n",
    "      因为同一个标签的所有定义的数据都将会被放入同一张表，属性名将会使用第一个定义中指定的名字。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 指定属性的数据类型\n",
    "\n",
    "GraphScope 可以从输入文件中推断点的类型，大部分情况下工作的很好。\n",
    "\n",
    "然而，用户有时需要更多的自定义能力。为了满足此种需求，可以在属性名之后加入一个额外类型的参数。像这样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g()\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"post\",\n",
    "    properties=[\"content\", (\"length\", \"int\")],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这将会将属性的类型转换为指定的类型，注意属性名字和类型需要在同一个元组中。\n",
    "\n",
    "在这里，属性 ``length`` 的类型将会是 ``int``，而默认不指定的话为 ``int64_t``。 常见的使用场景是指定 ``int``, ``int64_t``, ``float``, ``double``, ``str`` 等类型。\n",
    "\n",
    "\n",
    "### 图的其他参数\n",
    "\n",
    "类 ``Graph`` 有三个配置元信息的参数，分别为：\n",
    "\n",
    "- ``oid_type``, 可以为 ``int64_t`` 或 ``string``。 默认为 ``int64_t``，会有更快的速度，和使用更少的内存。当ID不能用 ``int64_t`` 表示时，才应该使用 ``string``。\n",
    "- ``directed``, bool, 默认为 ``True``. 指示载入无向图还是有向图。\n",
    "- ``generate_eid``, bool, 默认为 ``True``. 指示是否为每条边分配一个全局唯一的ID。\n",
    "\n",
    "\n",
    "### 完整的示例\n",
    "\n",
    "让我们写一个完整的图的定义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph = graphscope.g(oid_type=\"int64_t\", directed=True, generate_eid=True)\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"person\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"comment\",\n",
    ")\n",
    "graph = graph.add_vertices(\n",
    "    Loader(\"${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv\", delimiter=\"|\"),\n",
    "    label=\"post\",\n",
    ")\n",
    "\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"knows\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"person\",\n",
    ")\n",
    "graph = graph.add_edges(\n",
    "    Loader(\n",
    "        \"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv\",\n",
    "        delimiter=\"|\",\n",
    "    ),\n",
    "    label=\"likes\",\n",
    "    src_label=\"person\",\n",
    "    dst_label=\"comment\",\n",
    ")\n",
    "\n",
    "print(graph.schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里是一个更复杂的载入 LDBC-SNB 属性图的 [例子](https://github.com/alibaba/GraphScope/blob/main/python/graphscope/dataset/ldbc.py).\n",
    "\n",
    "## 从 Pandas 或 Numpy 中载图\n",
    "\n",
    "上文提到的数据源是一个 `Loader Object` 的类。``Loader`` 包含文件路径或者数据本身。\n",
    "``graphscope`` 支持从 ``pandas.DataFrame`` 或 ``numpy.ndarray`` 中载图，这可以使用户仅通过 Python 控制台便可以创建图。\n",
    "\n",
    "除了 Loader 外，其他属性，ID列，标签设置等都和之前提到的保持一致。\n",
    "\n",
    "### 从 Pandas 中载图"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])\n",
    "member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])\n",
    "group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])\n",
    "e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))\n",
    "df_group = pd.DataFrame(e_data, columns=[\"leader_id\", \"member_id\", \"group_size\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
    "avg_score = np.array(\n",
    "    [490.33, 164.5, 190.25, 762.0, 434.2, 513.0, 569.0, 25.0, 308.0, 87.0]\n",
    ")\n",
    "v_data = np.transpose(np.vstack([student_id, avg_score]))\n",
    "df_student = pd.DataFrame(v_data, columns=[\"student_id\", \"avg_score\"]).astype(\n",
    "    {\"student_id\": np.int64}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.\n",
    "# (for vertices, col_0 will be used as vertex_id by default)\n",
    "graph = graphscope.g().add_vertices(df_student).add_edges(df_group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 从 Numpy 中载图\n",
    "\n",
    "注意每个数组都代表一列，我们将其以 COO 矩阵的方式传入。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "array_group = [df_group[col].values for col in [\"leader_id\", \"member_id\", \"group_size\"]]\n",
    "array_student = [df_student[col].values for col in [\"student_id\", \"avg_score\"]]\n",
    "\n",
    "graph = graphscope.g().add_vertices(array_student).add_edges(array_group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loader 的变种\n",
    "\n",
    "当 ``loader`` 包含文件路径时，它可能仅包含一个字符串。\n",
    "文件路径应遵循 URI 标准。当收到包含文件路径的载图请求时， ``graphscope`` 将会解析 URI，调用相应的载图模块。\n",
    "\n",
    "目前, ``graphscope`` 支持多种数据源：本地, OSS，S3，和 HDFS:\n",
    "数据由 [Vineyard](https://github.com/v6d-io/v6d) 负责载入，``Vineyard``` 使用 [fsspec](https://github.com/intake/filesystem_spec) 解析不同的数据格式以及参数。任何额外的具体的配置都可以在Loader的可变参数列表中传入，这些参数会直接被传递到对应的存储类中。比如 ``host`` 和 ``port`` 之于 ``HDFS``，或者是 ``access-id``, ``secret-access-key`` 之于 oss 或 s3。\n",
    "\n",
    "```python\n",
    "\n",
    "    from graphscope.framework.loader import Loader\n",
    "\n",
    "    ds1 = Loader(\"file:///var/datafiles/group.e\")\n",
    "    ds2 = Loader(\"oss://graphscope_bucket/datafiles/group.e\", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')\n",
    "    ds3 = Loader(\"hdfs:///datafiles/group.e\", host='localhost', port='9000', extra_conf={'conf1': 'value1'})\n",
    "    d34 = Loader(\"s3://datafiles/group.e\", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})\n",
    "```\n",
    "用户可以方便的实现自己的driver来支持更多的数据源，比如参照 [ossfs](https://github.com/v6d-io/v6d/blob/main/modules/io/adaptors/ossfs.py) driver的实现方式。\n",
    "用户需要继承 ``AbstractFileSystem`` 类用来做scheme对应的resolver， 以及 ``AbstractBufferedFile``。用户仅需要实现 ``_upload_chunk``,\n",
    "``_initiate_upload`` and ``_fetch_range`` 这几个方法就可以实现基本的read，write功能。最后通过 ``fsspec.register_implementation('protocol_name', 'protocol_file_system')`` 注册自定义的resolver。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 序列化与反序列化 (仅 K8s 模式下)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当图的规模很大时，可能要花大量时间载入（可能多达几小时）。\n",
    "\n",
    "GraphScope 提供了序列化与反序列化图数据的功能，可以将载入的图以二进制的形式序列化到磁盘上，以及从这些文件反序列化为一张图。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 序列化\n",
    "\n",
    "`graph.serialize` 需要一个 `path` 的参数，代表写入二进制文件的路径。\n",
    "\n",
    "`graph.save_to('/tmp/seri')`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 反序列化\n",
    "\n",
    "`graph.load_from` 的参数类似 `graph.save_to`. 但是，其 `path` 参数必须和序列化时为 `graph.save_to` 提供的 `path` 参数完全一致，因为 GraphScope 依赖命名规则去找到所有文件，注意在序列化时，所有的工作者都将其自己所持有的图数据写到一个以自己的工作者ID结尾的文件中，所以在反序列化时的工作者数目也必须和序列化时的工作者数目 **完全一致**。\n",
    "\n",
    "`graph.load_from` 额外需要一个 `sess` 的参数，代表将反序列化后的图载入到此会话。\n",
    "\n",
    "```python\n",
    "import graphscope\n",
    "from graphscope import Graph\n",
    "sess = graphscope.session()\n",
    "deserialized_graph = Graph.load_from('/tmp/seri', sess)\n",
    "print(deserialized_graph.schema)\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
